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1 Introduction 


This document outlines a set of metrics for evaluating the diagnostic and prognostic schemes developed 
for the Vehicle Integrated Prognostic Reasoner (VIPR). A number of diagnostic and prognostic metrics 
are defined in the literature (e.g., [1, 2, 3]), but these standards are defined for well-circumscribed algo- 
rithms that apply to small subsystems. VIPR is designed to be a system-level reasoner that encompasses 
multiple levels of large, complex systems such as those for aircraft and spacecraft. The wide variety of 
reasoners employed in such systems span from individual line replaceable unit (LRU) health managers to 
area health managers (AHM) and the vehicle health manager (VHM) [4]. These health managers are or- 
ganized hierarchically and operate together to derive diagnostic and prognostic inferences from symp- 
toms and conditions reported by a set of diagnostic and prognostic monitors (DMs and PMs) [5]. A brief 
description of the layered VIPR architecture is presented in Section 2 of this document. 

Existing metrics for evaluating fault detection, fault isolation, and prognostics schemes are directly ap- 
plicable to the DMs, PMs, and LRU health managers. For layered reasoners such as VIPR, the overall per- 
formance cannot be evaluated by metrics solely directed toward timely detection and accuracy of esti- 
mation of the faults in individual components. Among other factors, overall vehicle reasoner perfor- 
mance is governed by the effectiveness of the communication schemes between monitors and 
reasoners in the architecture, and the ability to propagate and fuse relevant information to make accu- 
rate, consistent, and timely predictions at different levels of the reasoner hierarchy. 

To address these issues, we outline an extended set of diagnostic and prognostics metrics in this report. 
These metrics can be broadly categorized as evaluation measures for: (1) diagnostic coverage, (2) prog- 
nostic coverage, (3) accuracy of inferences, (4) latency in making inferences, and (5) sensitivity to differ- 
ent fault and degradation conditions. We also discuss possible cost-benefit metrics to capture the im- 
proved performance-to-cost calculations for the VIPR layered architecture. 

Our overall approach involves a systematic study of the effectiveness of the VIPR system using a simula- 
tion testbed designed to generate off-nominal events corresponding to several fault scenarios [5], A set 
of these fault scenarios is documented in a previous report [5]. The evaluation studies involve systemat- 
ic generation of degradation and fault data, realistic analysis using the monitors and reasoners in the 
VIPR architecture, and a methodology to compute the values for the chosen metrics using the perfor- 
mance data collected from the software testbed. Benchmarking VIPR makes it possible to assess how it 
can increase aviation safety. To achieve a benchmark, we evaluate VIPR's performance using the diag- 
nostics and prognostics metrics described in Section 4. 

Section 7 discusses the Hardware-in-the-Loop demonstration and reports on additional metrics extract- 
ed from a VIPR demo and test configuration that now includes a piece of avionics equipment, Honey- 
well's LaserRef VI Inertial Reference Unit (IRU). 

The rest of this report outlines the VIPR architecture, the simulation testbed for benchmarking studies, 
the metrics chosen for diagnostic and prognostics reasoner evaluation, and the summary and conclu- 
sions for this report. 
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2 VIPR Architecture and Functionality 

The primary function of VIPR is to detect and isolate faults and failures at the aircraft level. A simplified 
functional view of VIPR is shown in Figure 1. VIPR is organized into a hierarchical architecture. At each 
level or layer of the hierarchy, the VIPR processing blocks maintain relationships with other blocks at the 
same level. 



Figure 1: Functional View of VIPR 


To satisfy bandwidth and power constraints [4, 5], only a subset of messages is allowed to flow from one 
level to another. At the lower level LRU, HMs receive measurements from the sensors, and they perform 
diagnostic and prognostic (DP) monitoring tasks to compute DP indicators. The next level is organized in- 
to multiple AHMs that follow the principal spatial and temporal decomposition of the aircraft functional- 
ity and behavior. The main task of an area HM is to perform DP reasoning using the indicators provided 
by the LRU HMs. Finally, a VHM is responsible for collecting the data from all AHMs and solving any am- 
biguities with the assistance, if necessary, of off-vehicle health management services. In the following, 
we first describe the candidate algorithms that can be used for the VIPR and then discuss the infor- 
mation flow between the various levels of the VIPR. 
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LRU Health Manager 


At the LRU level, the objective is twofold: (1) discover information that can be used for DP reasoning in 
raw and noisy measurements by performing feature extraction and (2) compress the data so that they 
can be efficiently transmitted and used by the higher levels for more integrated analyses (e.g., reasoning 
about the effects of fault propagation between subsystems). Both tasks are accomplished by a suite of 
DP monitors that can be classified into two categories: (1) simple DP monitors and (2) advanced DP 
monitors. 

Simple DP monitors test whether a sensor measurement or measurement rate exceeds a threshold. All 
major subsystems in an aircraft have built-in tests (BIT) that perform such operations and present the 
simplest form of feature extraction, generating binary health indicators. Mathematically, these tests are 
based on well-defined detectors such as likelihood ratio test, z-test, and t-test. In addition to such algo- 
rithms, advanced DP monitors are used for discovering and extracting information from multivariable 
measurement sets. 

A representative algorithm for such a monitor is principal component analysis (PCA), which transforms a 
number of possibly correlated variables into a smaller number of uncorrelated variables called principal 
components. After using multivariate signal processing algorithms, advanced DP monitors can use sim- 
ple classification and trending algorithms to encapsulate information to DP indicators that are forward- 
ed to the Area HMs. Candidate algorithms include bin classifiers, nearest-neighbor classifiers, and dis- 
criminant analysis. The computed indicators include: (1) condition indicators that can describe, for ex- 
ample, the engine compressor efficiency and spectral energy content from a vibration signal; (2) health 
indicators that capture, for example, inlet filter, compressor rub, or foreign object damage (FOD) inci- 
dents; and (3) prognostic indicators that show, for example, the evolution of engine health for the next 
100 hours of a specific mission. 

Area Health Manager 

AHMs conduct DP reasoning for aircraft subsystems, including multiple LRUs. They are organized along 
spatial and temporal boundaries of fault manifestation and propagation to minimize the communication 
between aircraft subsystems. Since perfect containment of a fault in one area cannot be guaranteed, 
AHMs can query remote LRUs if necessary. Candidate algorithms at this level include decision trees, dis- 
crete event system diagnosers, failure propagation graphs, neural networks, fuzzy classifiers, and Bayes- 
ian networks. Heterogeneous reasoners deal with binary, discrete, and continuous indicators provided 
by the LRUs as well as with event-driven and time-driven dynamics of the underlying aircraft compo- 
nents. 

Vehicle Health Manager 

The VHM is responsible for reasoning across spatial and temporal boundaries of the various areas and 
possibly uses off-vehicle health management services. The VHM resolves ambiguities that may arise 
from the AHMs, initiates additional DP tests, and provides warnings. The DP reasoning technologies are 
similar to those in the AHMs, but special care must be taken to deal with multiple temporal scales. 

Reference Models 

Managing and evaluating the operation of VIPR requires a database of the health management compo- 
nents that are available at the LRU, area, and vehicle levels. Every component is associated with a refer- 
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ence model (implemented as an XML file) that captures the interface of the component and information 
about the internal functionality if available. At the very least, it defines the input output relations for the 
associated component. 

Figure 2 and Figure 3 show hierarchical and central/flat reasoning. Both reasoners use corresponding 
reference models that define the monitors source and diagnostic relationships. In the hierarchical refer- 
ence model, the LRU reference model defines all the monitors at the LRU level as well as the bipartite 
graph that connects the monitors to the LRU failure modes. The fault cascade/common causes are cap- 
tured at the area level along with the faults in area level systems such as the fuel lines, electrical buses, 
etc. At the vehicle level, the reasoner model captures the relationships in between the area-level faults 
and defines the inhibitors to suppress nuisance faults. 

VIPR uses a reference model designed to support a VIPR-type hierarchical system. In such a system, 
monitor generation and active queries initiation resolve the LRU-level ambiguity. The area-level 
reasoner uses the area-level reference model to isolate faults and discover fault cascades. In contrast, 
the central reasoners use a flat reference model. In this type of reasoning, the monitors may be gener- 
ated at LRU level. All other functions of the reasoner, including query, isolation, cascade reasoning, fu- 
sion, and inhibits are performed at the vehicle level. The VIPR reasoner code is designed to use either 
reference model for reasoning, which allows comparison of results from the distributed (hierarchical) 
and central (flat) reasoners. 



Figure 2: Hierarchical Reasoning 
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Figure 3: Central Reasoning 


Information Flow 

VIPR's basic information flow starts from the sensors that communicate raw measurements to the LRU 
HMs. The LRU HMs compute DP indicators using simple or advanced DP monitors and send them to the 
AHMs. The DP reasoners in the AHMs generate fault candidates that are sent to the VHM. At the vehicle 
level, reasoners generate detections and predictions of failure modes and advisories. 

In addition to communicating the output of the health management components at this level, VIPR con- 
siders an enhanced information flow that complements the component results with metadata that pro- 
vide valuable information related to how these results have been computed. The metadata communi- 
cated instantiate the attributes of the reference model of the corresponding component in the VIPR ar- 
chitecture constructing an accurate runtime representation of the VIPR configuration. 

The information flow can then follow two paradigms. First, low-level components can forward important 
messages to higher levels upon detection, for example, of adverse events. Second, high-level compo- 
nents can actively query low-level components for specific information that can be used to disambiguate 
fault candidates or improve fault prediction. In addition, VIPR supports active sensor tests that are in- 
voked on demand. 


Evaluation 

Evaluation of the vehicle-level health management architecture, such as VIPR, must assess how it can in- 
crease aviation safety. This goal is directly linked to the following measures: 

1. Diagnostic coverage 

2. Prognostic coverage 

3. Accuracy 

4. Latency 

5. Sensitivity 

Given VIPR's hierarchical architecture, the benchmarking process must consist of two steps: 

1. Quantifying the effectiveness of each VIPR in terms of the above metrics 
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2. Determining the accumulated inaccuracies as information is passed up the architecture 

Well-defined metrics that can be used to evaluate VIPR performance (see next section). In addition to 
such measures, it is important to evaluate the efficiency and scalability of VIPR in terms of the computa- 
tional resources needed as well as to quantify the trade-offs between performance and resource usage. 
The enhanced information flow and the active querying described above, for example, can improve per- 
formance and increase safety but they require increased computation and communication capabilities. 
Cost-benefit analysis is then necessary to determine the optimal VIPR configuration. The following sec- 
tion describes the metrics to be computed along with the discussion on the cost analysis. 

3 VIPR Metrics 

This section defines what and how we generate summary statistics from the algorithm performance and 
message logs and describes the Monte Carlo experiments run to exercise the reasoner and generate 
performance and message logs. 

A number of diagnostic and prognostic metrics exist, but these standards are defined for well- 
circumscribed algorithms that apply to small subsystems. For layered reasoners such as VIPR, the overall 
performance cannot be evaluated by metrics directed solely toward timely detection and accuracy of es- 
timation of the faults in individual components. Among other factors, the overall vehicle reasoner per- 
formance is governed by the effectiveness of the communication schemes between monitors and hier- 
archical reasoners and the ability to propagate and fuse relevant information to make accurate, con- 
sistent, and timely predictions at different levels of the reasoner hierarchy. An added functionality of 
this architecture is the ability of the vehicle- and area-level reasoners to generate specific queries for the 
component monitors. To address these issues, we have developed an extended set of diagnostic and 
prognostics metrics that can be used to evaluate the performance of the layered architecture. The met- 
rics are summarized in the following sections. 

3.1 Accuracy and Computation Metrics 

3.1.1 Accuracy Metrics 

Generation of the reasoner accuracy metrics: The reasoner accuracy metrics are generated by running 
the reasoner in tandem with the fault simulator. The fault simulator works as an evidence source that 
generates the monitors for the seeded faults. The reasoner accuracy metrics are captured directly from 
the MATLAB® run. It is not necessary to run the GUI to capture the reasoner accuracy metrics; however, 
the reasoner accuracy metrics are calculated for all runs, including the non-metrics collection runs from 
the GUI and are displayed on the MATLAB console at the end of the reasoner run. 

The following data is collected to capture the reasoner accuracy metrics: 

• Simulated fault is captured for metrics computation. 

• List of all monitors that fired and the time at first firing. 

• Diagnostic accuracy, which is the accuracy of the reasoner's diagnostic conclusions. It is accom- 
plished by comparing the final reasoner conclusion for all the simulated faults. The diagnostics 
accuracy captures the following sub-metrics: 
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o Rate to false alarms: given absence of faults, detecting faults when no faults are pre- 
sent. 

o Rate of true detects: given presence of detected faults, detecting the faults that are pre- 
sent. 

o Rate of false detects: given presence of some faults, detecting faults that are not pre- 
sent/simulated in the scenario. 

o Rate of miss detects: given the presence of some faults, missing the detection of a simu- 
lated fault. 

• Prognostic accuracy: Accuracy of the prognostics. The metrics captures the fusion of two or 
more prognostic vectors and discovery of precursors through data mining 

• Time to detect as measured as time from the first appearance of the indicting monitor. 

• Time to isolate two or more faults. This is also measured from the time of initiation of the first 
set of monitors corresponding to the simulated faults. 

• Detection rate for intermittent faults. The types of intermittent fault are shown in Figure 4. For 
accuracy metrics, we concentrate on the intermittent evidence leg. 

intermittent 


evidence failure mode 



non-latching overlong within flight across flight 

period 

Figure 4: Types of intermittent faults 

• Isolation layer/reasoner: This metric captures information about the reasoner that isolated the 
fault for all simulated faults. The conjecture is that complex faults are isolated by higher level 
reasoners (such as the AHM, VHM reasoner) and, this metric will verify that hypothesis. 

This study is repeated for a select set of faults with multiple reasoner parameters, such as: 

• Threshold for splitting/merging 

• Threshold for acceptance/ rejection 

• Threshold for isolation/ambiguous 

• Threshold for fault condition closeout 

3.1.2 Computational Metrics 

Communication and profiling data is collected at run time to compute the complexity cost. The commu- 
nication metrics are utilized to compare the distributed layered reasoner architecture with the central 
reasoner architecture. In the distributed reasoner, the computations occur at all the layers and only the 
conclusions, active queries, and broadcasts are sent out to the next reasoner level. In the central 
reasoner, all the computations and disambiguation occurs at the central reasoner. 
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To better understand the communication tradeoffs the following data is collected: 

• Communication costs: Communication costs are computed from the total number of communi- 
cations and the bandwidth utilization required to isolate a fault by the reasoner in a given archi- 
tecture. Computation begins at the time of fault inception. To compute the communication 
costs, these data items need to be logged: 

o Message source ID 
o Destination ID 
o Timestamp 
o Message number 
o Packet type 
o Packet subtype 
o Payload size 

Message cost can be described as by cost function that is proportional to the payload size 
from source to destination layers. 

• Message delays can be incorporated in the communication cost through post analysis. For ex- 
ample, for a message from an LRU reasoner to the AHM, the communication delay could be 
modeled with a bounded delay. 

• Bandwidth utilization computation is accomplished by assuming that VIPR has a fixed maximum 
percent of the communication bandwidth and then computing the latency implied by the band- 
width limitation. For example, the communication bandwidth can be assumed to be 10Mbps, 
and VIPR can be assumed to be limited to, at most, 1% of the communication bandwidth. The 
bandwidth utilization computation compares the communications in both the central and dis- 
tributed reasoner architectures. 

• At the reasoner level, the following information is also collected: 

o CPU execution times 
o CPU utilization 
o Memory utilization 

3.2 Monte Carlo Experiments 

We use Monte Carlo experiments to exercise the reasoner and generate performance and message logs. 

The objective of these experiments is to generate the: 

• Accuracy metrics 

• Computation metrics 

The accuracy metrics are generated by running the reasoner using the complete VIPR reference model 

along with the fault simulator under the following combinations of conditions: 

• Faults number: 

o No fault baseline 

o All possible single complex faults 

o All combinations of two complex fault conditions 

• Fault types 
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o Latched faults 

• Evidence/Monitors 

o Random evidence coming on line, i.e. the monitors fire based on stochastic probabilities 

• Evidence type: 

o Latched monitors 

o Chattering monitors (only for non-indicting monitors) 

• Special cases: 

o Two or more prognostic monitors need to be fused to generate the accuracy metrics 
o Multiple fault cases that lead to ambiguity in the reasoner conclusions 

• Time of inception of fault 

o 200 seconds into flight 

• Reasoner parameters: Repeated simulations for the selected set of faults with multiple reasoner 
parameters, such as: 

o Threshold for splitting (3 thresholds) 
o Threshold for merging (3 thresholds) 

The computational metrics are run over all faults for the small reference model with both flattened and 
layered reference models. We consider all 1-fault and 2-fault combinations and study only latched fault 
and monitor states. The effect of simultaneous-versus-staged evidence is also studied. The communica- 
tions costs are logged and analyzed further by fitting multiple communication delay/cost models. 

4 VIPR Evaluation Approach 

The VIPR evaluation approach is illustrated in Figure 5. We used a regional airline data base to enhance 
the reference model and to generate monitors to test VIPR. The reference model is also an input pa- 
rameter to the failure mode simulator. The failure mode simulator generates evidence streams that 
correspond to a selected failure mode from the reference model. The evidence stream is then fed to 
the VIPR reasoner. Reasoner outputs such as isolated faults, detected faults, time of isolation, isolating 
reasoner, etc. are logged and analyzed by the metrics analysis scripts. The metrics analysis scripts sum- 
marize the results and calculate false alarm rates, true detect rates, average time to isolate, number of 
reasoner messages, etc., which is then used for reasoner and reference model improvements. 



Figure 5: VIPR Evaluation Approach 
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Figure 6 shows four places in the overall system flow of information where inputs to the reasoning pro- 
cess can greatly affect the quality of the reasoner's results. 
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Figure 6: VIPR Data Flow Showing Four Opportunities for Evaluating Results and Refining Inputs 


Q is the reference model for describing the vehicle— clearly, it needs to be an accurate reflection of 
the vehicle. 

& is the quality of the evidence stream— low-fidelity evidence will erode the reasoner's ability to accu- 
rately isolate a fault. 

o indicates the settings that tune the reasoner's operation, such as the confidence threshold for de- 
claring a fault isolated. 

Q measures the reasoner's effectiveness and use those results to fine tune the reference model. 

We evaluated the sensitivity of the reasoner settings and reported results in Section 6.2.2. The other 
three evaluations need to be done in the context of a specific vehicle, high-fidelity reference model, and 
actual evidence streams. 


5 Failure Mode Simulator 

We developed the failure mode simulator to enable exhaustive testing of the VIPR reasoner because it is 
not possible to extract all types of failure modes from the regional airline data base. The failure mode 
simulator uses the reference model to generate the monitor evidence for the simulated faults. The sim- 
ulator uses stochastic processes for setting monitor firing and monitor activation times, and for generat- 
ing false alarms. 
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Figure 7 and Figure 8 show the reference model in graphical and textual format. Figure 9 shows the 
three thresholds that are used to generate the diagnostic monitors. The three thresholds are used 
against uniformly distributed random numbers to simulate the stochastic nature of the monitors. The 
first threshold, the Monitor Valid threshold, is an exponential threshold that captures the behavior of 
monitors as they come on-line at the beginning of the flight. This threshold is used to generate an inva- 
lid/cannot compute (represented as -1) monitor response. The false alarm threshold is used to generate 
false monitor firing throughout the simulation window. The fault detect threshold becomes active only 
after the injection of fault. It is used to decide if the corresponding monitor fires on/or not when the 
fault is present (injected). Both the false alarm and the detection threshold can be read off the reference 
model as highlighted in Figure 7. 

Following steps capture working of the fault simulator and monitor /evidence generation. It is assumed 
that the simulated failure mode is represented in the reference model. Then: 

1. Find all monitors (M) in the reference model 

2. Simulate an array of M times T, where T is the number of frames of evidence to be generated 

3. The monitor generation program includes some additional logic that first determines the valida- 
tion of the monitor. The validity is determined by checking if the simulated sample value is 
greater than the Monitor valid threshold. If this condition is not met then Mj is set to -1. If the 
monitor is valid then Monitor False alarm or the fault injection threshold is used to set the moni- 
tor Mj = 0 or 1. 

4. After sample time >=fault injection time 

a. Get all evidence of interest (monitors) and their detection probability 

b. Find if there are any prognostic monitors associated with the diagnostics monitors. Sim- 
ulate the time of issue for the prognostics monitors. Simulate the prognostics vector. 

c. For all indicting diagnostics monitor check if evidence of interest Mj>l-dij, then, Mj = 1 
otherwise 0 

5. Time Mj first becomes >l-dij is set as monitor time stamp 

Figure 10 shows the sample diagnostics and prognostics vectors. Simulations are run multiple times us- 
ing different random number seeds. 
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Figure 9: (a) Three thresholds for monitor generation; (b) Diagnostics monitors 
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Figure 10: Types of evidence generated by the failure mode simulator 


5.1 Selection of Failure Modes for Simulation 

The failure modes to be simulated by the simulator are selected only if they satisfy the following criteria 
for complex failure modes (Figure 11): 

1. Select monitors that indict two or more failure modes. Monitors that are linked to a single fault 
have been eliminated because they do not generate interesting tests. 

2. Select failure modes that have two or more indicting monitors. 

The hierarchical and flat reference models each defined these 22 complex fault conditions: 

• APU 

o EC Blade Rub 
o Fuel Metering Fault 
o Starter Fault 
o Igniter Assembly Fault 
o Turbine Erosion 
o Nozzle Clogging 
o ECB Fault 
o No Fuel 
o Bearing Fault 
o Inlet Blocked 
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• Engine 

o Al Stuck Valve 
o Fan/LPC Degradation 
o HPC Degradation 
o AC Duct Rupture 
o Fadec Fault 
o Fuel Metering Fault 
o FI PT Degradation 
o Igniter Fault 
o Inlet Fouling 
o Shutoff Drain Valve Fault 
o Starter Fault 
o Nozzle Clogged 

The simulations are configured to exercise only the complex failure modes, which ensures that the sta- 
tistics are not skewed by the simple faults which are easy to detect and isolate. 
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Figure 11: Failure mode to detection matrix 
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6 Profiling the Reasoner 

6.1 Overview 

Figure 12 illustrates the overall approach for measuring the performance of the VIPR reasoner. 



VIPR 

Reasoner 



Metrics 

Analysis 


Figure 12: Components in the VIPR metrics analysis 


For a given aircraft fault scenario, the failure mode simulator was used to generate an evidence stream 
that contained symptoms of the inserted faults as well as randomly generated evidence for spurious 
faults. Using the generated evidence stream, the VIPR reasoner was then executed for both the hierar- 
chical and flat aircraft reference models. During its execution, the reasoner logged information about its 
execution that was analyzed by an offline tool that computed the accuracy, performance, and cost met- 
rics reported in this section. 

6.2 Metrics Generation Protocol 

The hierarchical and flat aircraft reference models define more than 40 failure modes. Of these, 22 fail- 
ure modes were considered complex (see Section 5.1). 

We simulated single and multiple fault scenarios. The 22 single fault scenarios were each simulated 10 
times using different evidence streams that each contained 0.1% randomly generated false evidence. 
Therefore the single fault data for each reference model was generated from the simulated insertion of 
220 faults. 


The 22 faults combine into 231 sets of double faults, and each two fault combination was simulated 
once. Therefore the multiple fault data was generated from the simulated insertion of 462 faults, two 
insertions for each of 231 test cases. 


This means that the analysis was performed over data accumulated from running each reference model 
over 451 evidence streams that simulated 682 failure modes. 
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6.2.1 Failure Mode Simulator Parameters 

We ran all failure mode test cases for a simulated time of 2,000 seconds, with evidence of faults provid- 
ed in 10 second increments. Fault insertion, for both single and multiple fault scenarios, always occurred 
at simulated time 200, and randomly generated erroneous evidence could appear at any time during the 
2,000 second simulation. 

We assumed that the vehicle was able to completely process a set of evidence from LRU HM through 
the VHM before the arrival of the next set of evidence 10 seconds later. That is, we assumed that pro- 
cessing latency was always less than 10 seconds. 

Each diagnostic monitor was supplied with false information 0.1% of the time. The aircraft reference 
models each contain about 40 diagnostic monitors, and during a simulation, each monitor is supplied 
with 200 values (2,000 seconds divided by the 10 second time increment). The generation of 8000 moni- 
tor values over a test case means that, on average, eight monitor values were caused by false infor- 
mation. 

At simulation start, all monitors were considered inactive and became active at an exponential rate with 
a mean of 20 seconds. This means that by 20 seconds into the simulation, half of the monitors had been 
activated, and by the fault insertion time 200 seconds, there was a very high probability that all moni- 
tors were active. 

6.2.2 Reasoner Parameters 

The reasoner contains several parameters for tuning its operation. We focused on the DELTAJ parame- 
ter, which affects the reasoner's fault isolation sensitivity. With too high a value, the reasoner may have 
difficulty isolating the inserted faults, while too low a value may cause it to incorrectly isolate faults that 
were not inserted. 

When the reasoner receives evidence of a failure, it computes the likelihood for each fault condition 
that could have caused the evidence. One criterion for achieving fault isolation is for the likelihood of 
the isolated fault to be much higher than the likelihood for other faults. DELTAJ is the measure for how 
much higher. For lower values, the reasoner requires less evidence to indict a failure condition. Conse- 
quently, the reasoner is faster at isolating failure modes but more likely to indict a failure condition that 
does not exist. 

DELTAJ is expressed as the log of the ratios of the likelihood. Therefore, setting DELTAJ = 2 means that 
to achieve isolation, the fault's likelihood must be at least 100 times greater than the likelihoods of the 
other faults. 

We tried three values of DELTAJ (1.5, 2.5 and 3.0) and determined that for these aircraft reference 
models and our analysis protocol, the 3.0 value produced the best results: fewest incorrect indictments 
without a significant drop-off in indicting the inserted faults. 

6.3 Accuracy Analysis 

The accuracy analysis was based on the VIPR reasoner's final state after each 2,000 second simulation. 
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As the reasoner receives evidence of failure modes, it builds ambiguity groups consisting of a group 
state and a set of failure modes. The likelihood that the particular failure condition is the cause of the 
evidence is associated with each failure mode. These likelihoods are relative to the other failure modes 
listed in the same ambiguity group. The likelihood of a failure condition in one ambiguity group cannot 
be compared to the likelihood of a failure condition in a different ambiguity group. 


For analysis, ambiguity groups were in one of these two states at the end of the simulation: 


Isolated 

Isolated is a terminal state. In this state, the reasoner has completed analysis for the ambigu- 
ity group and has identified a failure mode, which is the only failure condition remaining in 
the ambiguity group. Additional evidence will not improve the solution. 

Waiting 

In this state, the reasoner has received sufficient evidence to form the ambiguity group con- 
taining one or more potential failure conditions, but the evidence is not sufficiently strong to 
indict any one of them. Typically, the likelihood of one of the faults is very high, but not high 
enough to meet the DELTAJ threshold. 


In an operational environment, we expect the reasoner results to be presented to an aircraft technician 
trouble shooting the fault in three lists: 


• The isolated faults 

• The faults detected with high likelihood (but not isolated) 

• Remaining fault conditions reported by reasoner (fault conditions listed in an ambiguity but with 
low likelihood) 

We expect the technician to use this information to develop a troubleshooting strategy by subjectively 
comparing the value of isolated and detected faults and integrating this information with specific 
knowledge about the aircraft. 

We expect the technician to focus on the isolated fault conditions first and use information about the 
other reported fault conditions only when either the isolated fault list is empty or when investigation of 
the isolated faults does not lead to the malfunction. 

We used the following paradigm for computing the accuracy metric. If the reasoner isolated one or 
more faults, we computed accuracy using just the results for the isolated faults and ignored the results 
for faults that were detected but not isolated. However, if no faults were isolated, then we based the 
accuracy analysis on just the faults detected with high probability. 

This procedure for measuring accuracy can understate accuracy for the multiple-fault scenarios. Consid- 
er a case in which the reasoner correctly isolates one fault (and does not isolate any other fault condi- 
tions) and correctly detects, with high probability, the second fault (and does not detect with high prob- 
ability any other fault conditions). If the two faults were injected in two single-fault scenarios, the out- 
come for isolating the first fault would be "very good" and the outcome for detecting the second fault 
with high probability would be "good." However, if the two faults were injected during the same test 
case, the reasoner would have a "very good" outcome for isolating the first fault but a "poor" outcome 
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for the second fault, since in the presence of an isolated fault, we would not consider that the faults 
were detected with high probability. 

6.3.1 Accuracy of Flat and Hierarchical Models 

We measured the reasoner accuracy to the flowing outcomes: 

• Inserted fault conditions isolated 

• Incorrect fault isolations 

• Inserted fault conditions detected with high probability (but not isolated) 

• Incorrect fault conditions detected with high probability 

• Missed fault conditions: inserted fault conditions that were not even detected 

We developed the following notation for marking these conditions: 


1 

Fault isolated. 

D 

Fault detected with high likelihood but not isolated. 

* 

Inserted fault. For example, *1 indicates that the inserted fault was isolated. 

+ 

A fault that was not inserted. For example, 1+ indicates a fault that was isolated but not inserted. 

M 

Missed fault condition. 


The and '+' annotations can both be combined with the 'I' and 'D' annotations. For example, a test 
outcome with the annotation '*1+' indicates that the inserted fault was correctly isolated, but other fault 
conditions were incorrectly isolated as well. 


We ran the hierarchical and flat reference models for the same set of evidence streams and got very 
similar accuracy results for the two models, as shown in Figure 13 below. 

Of the 220 single fault test cases, 38% ended with the best result: the inserted fault, and only the insert- 
ed fault, was isolated (the *1 column in the figure). In an additional 5% of the test cases, the inserted 
fault was isolated, but other fault conditions were isolated as well. This outcome is less useful since, 
while it identifies the fault, it does so with ambiguity. 

Considering test cases that contained no fault isolations, in 37% only the inserted fault was detected 
with high likelihood (the *D column) and in an additional 18%, the inserted faults were among several 
fault conditions detected with high likelihood. 

In 2% of the test cases, the reasoner produced a result that would mislead the maintenance technician 
(the 1 + column) by not including the inserted fault among its list of isolated faults. 

In summary, for the single fault case, the reasoner provided an unambiguous and correct result in 75% 
of the test cases and a correct but ambiguous result in an additional 23% of the test cases. 
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Single Fault Insertion 



220 Outcomes 


■ hierarchical 

■ flat 


Figure 13: Results of single fault simulations 



Outcomes for multiple fault case were not quite as good, as shown in Figure 14 below. The outcome was 
correct and unambiguous in 45% of the test cases (columns *1 and *D) and correct but ambiguous in 
31% of the test cases. In 14% of the test cases the reasoner indicted other fault conditions without in- 
dicting an inserted fault, and it failed to detect 5% of the fault insertions. 
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Figure 14: Results of multiple fault simulations 


Multiple Fault 



6.3.2 False Alarms 

A false alarm is a fault detection caused by false evidence that appears in the evidence stream before 
fault insertion and that is not cleared by the reasoner by the end of the simulation. Twenty-one false 
alarms occurred in the 220 single fault test cases and no false alarms occurred during the 231 double 
fault test cases. Note that the 21 false alarms occurred in just three test cases; conversely no false 
alarms occurred in 448 of the 451 test cases. 
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6.4 Isolating Node 

An important difference between the hierarchical and flat reference models is that reasoning is distrib- 
uted in the hierarchical model (reasoning can occur at any of the LRU, area and vehicle levels), while, for 
the flat model, reasoning is centralized at the vehicle level. This difference is reflected in the table be- 
low, which reports the vehicle architectural level for each fault isolation. This data combines results for 
both single and multiple fault insertion, and includes all isolations, not just isolations of the inserted 
faults. 

While isolation for the flat reference model will always occur at the 
vehicle level, for this hierarchical model, isolation almost always oc- 
curred at the LRU level. This may be an advantage since computing 
resources at the LRU level tend to be less expensive and more avail- 
able than at higher levels of the architecture. 



Hierarchical 

Flat 

LRU 

519 


Area 

1 


Vehicle 


519 


6.5 Time to Isolate 

We measure the time to isolate in steps of the discrete event simulation. Each step was 10 seconds long, 
and we assumed that computation for one step completed before the start of the next step. This as- 
sumption held true when computing latency was less than 10 seconds. So for example, if a fault required 
three steps to isolate, and a step was 10 seconds long and computing latency was less than 10 seconds, 
then the fault would take 30 seconds to isolate. Time to isolate was the same for the two reference 
models. 

For the single fault test cases, the worst case time to isolate was 15 steps, and the average case was only 
0.6 steps. The wide difference between the average and worst cases occurs because most isolations oc- 
curred during the step when the fault was inserted. 

For the multiple fault test cases, isolation time was dramatically longer. The worst case time to isolate a 
fault was 153 steps and the average case was 13.6 steps. 

6.6 Safety Impact and Accuracy of Prognostics 

We tested two measures for prognostics. First, using the fault simulator, we tested the prognostic fu- 
sion accuracy. This measure tests the accuracy of the prognostics fusion rule. The fusion rule was found 
to be accurately implemented (example in Figure 15). 

Our second prognostic measure was to generate monitors to detect precursors to significant safety inci- 
dents such as in-flight engine shutdown. Figure 16 shows three such incidents. In the first incident, 
starting from the left of Figure 16, the precursors were detected onboard and isolated approximately 30 
flights early, therefore, VIPR could provide the maintainer time to intervene and prevent the safety inci- 
dent. While the "discovered prognostic monitor" is very accurate, VIPR takes an "engine-wide view" 
and also suspects some secondary damage in the hot section, as defined in the manufacturer's FMEA. 
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In the second case, the precursors were detected and isolated approximately 20 flights early. While the 
"discovered prognostic monitor" is very accurate, VIPR is taking an "engine-wide view" and looking for 
more supporting evidence such as those defined in the manufacturer's FMEA. 

VIPR conclusions from the third case show that the precursors were detected onboard with high likeli- 
hood. The precursors appear as different problems in the two engines. VIPR uses cascade reasoning and 
active query of the remaining two engines (#2 and #3) to identify a common cause - fuel delivery mani- 
fold. This then suppresses the net result of a high false alarm monitor at individual engine level. 

These three cases show the prognostic accuracy of VIPR case by case and have also illustrated how the 
VIPR approach detects precursors to safety incidents well in advance of the actual event (in-flight shut- 
down). 




Figure 15: Fusion of two prognostic vectors 
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Figure 16: Discovery of prognostic monitors using data mining on an airlines database 
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VIPR prognostics do not predict the time when a failure will occur or otherwise provide an estimate of 
the remaining useful life of a component. 

6.7 Communications Volume 

Communications volume is characterized by the number of messages sent by the reasoner, and the size 
of the messages. Message format was defined by the ARINC 624 standard. Tabulations for the two ref- 
erence models under single and multiple fault scenarios are shown in Table 1. 

As expected, the distributed hierarchical model required more messages than the flat model to achieve 
equivalent fault detection and isolation results. For the single and double fault test cases, the hierar- 
chical model required 28% and 29% more messages, respectively, than were required for the flat refer- 
ence model. The messages for the hierarchical model tended to be a bit larger as well, and in total, the 
hierarchical model required the transmission of about 40% more bytes. 

The multiple fault test cases required significantly more messages than the single fault cases. On a per- 
inserted-fault basis, the double fault test cases sent 70% more messages per fault than the single fault 
cases, and twice as many bytes. This held true for both the hierarchical and flat models. 


Table 1: Communications volume for the two reference models under single and multiple fault conditions 



Transactions 

Bytes 

Bytes/Tran 

Trans/Fault 

Bytes/Fault 

Flierarchical, single fault 

40,972 

7,907,861 

193 

178 

34,382 

Flat, single fault 

32,053 

5,594,980 

175 

139 

24,326 

Flier, multiple fault 

142,304 

31,805,968 

224 

308 

68,844 

Flat, multiple fault 

110,264 

23,019,124 

209 

239 

49,825 


6.8 Communications Latency 

Safety-critical aircraft communications systems, such as AFDX and ASCB, are designed with statically de- 
fined periodic schedules. Therefore, each message that can be sent during operational use of the aircraft 
must fit into a preallocated slot in the schedule for that message. This paradigm works well for systems 
that continuously produce information that must be propagated to other parts of the aircraft. 

In contrast to the periodic communications system, reasoner messaging is sporadic. The reasoner does 
not send messages when there is no evidence of faults or the fault evidence is unchanging. Flowever, 
the arrival of evidence triggers activity that results in the transmission of a sequence of messages. 

6.8.1 Latency Model 

Computational latency is the time required to execute a transaction. In an avionics system, we expect 
the communication latency to be much longer than the processing latency, and hence, considered just 
the communication's contribution to latency. 

The reasoners produce transactions at sporadic times. That is, when the monitors are not reporting any 
fault symptoms, the reasoners are quiet. Flowever, when a fault occurs, one or more monitors may initi- 
ate transactions, the effect of which can then ripple through the network of reasoners. 
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Avionics communications protocols tend to have periodic schedules that are static, and hence are not 
well-suited for handling sporadic communication. In a static, periodic schedule, every message that can 
be sent must be allocated a place in the predetermined schedule. If the message is not periodic and 
needs to be sent infrequently, then the message could be allocated a very small part of the communica- 
tions bandwidth. But when there is a flurry of activity, a small bandwidth can result in long latencies. Al- 
ternatively, the message could be allocated a larger bandwidth, but at a loss in communications utiliza- 
tion since the larger bandwidth would be used infrequently. 

This effect is illustrated in Figure 17, which shows the arrival of three messages, represented by the yel- 
low spikes, and message transmission bursts for two communications rates, shown in green and blue. 
Message transmission begins on the arrival of the first message and continues through the second mes- 
sage, because the transmission of the first is not completed when the second arrives— at either rate. 
The green rate is high enough to complete sending the second message before the arrival of the third, 
while the blue rate is not. Hence, when the third message is ready to be sent, the green rate begins 
sending it immediately, while the blue adds it to its transmission queue. 



Figure 17: A notional representation for latency when sending a burst of messages 

over a periodic communications system. 

We express communications latency by extrapolating from the durations of the message transmission 
bursts. In Figure 17, the green rate requires two bursts to transmit the three messages while the blue 
rate has just the one long burst. 

We measured both the average and maximum durations of bursts for a range of transmission rates and 
fault conditions and for the two reference models. While average burst length tends to be very low, the 
more important metric is the worst case burst, which can be quite long. 
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For each fault scenario, the maxi- 
mum burst was calculated for 
bandwidths in the range 10 
bytes/second, 100 bytes/second, 
1000 bytes/second, etc., stopping 
when the maximum burst for a fault 
scenario was less than ten seconds 
long. The fault scenarios required a 
message transmission rate of 
10,000 bytes/second to achieve the 
worst case ten second threshold. 


Average Latency - 10 KB/s 



Hierl fault Flat 1 fault Hier 2 faults Flat 2 faults 


Maximum Latency - 10 KB/s 



Hierl fault Flat 1 fault Hier 2 faults Flat 2 faults 


Figure 18: Communications latency computed from the single and mul- 
tiple fault simulations 
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6.9 Cost Analysis 


6.9.1 Cost Model 


The VIPR reasoner consists of a collection of entities that 
exchange messages. Messages tend to flow from the sen- 
sor monitors and up through the hierarchy of LRU, area 
and vehicle reasoner entities, and finally to a consumer of 
the vehicle health information, such as an aircraft display. 
This flow of information is depicted in Figure 19. 

Cost is computed as the sum of the costs for each reasoner 
transaction. A transaction embodies the computation and 
communications required to send a message from a source 
entity to a destination entity, as depicted in Figure 20. 


Displays, etc. 



Figure 19: Fault information flows from the 
sensor monitors through the reasoner entities 
to the users of the reasoner's conclusions 
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Figure 20: Cost is computed as the sum of transaction costs. 

A transaction includes half of the source entity's processing, the communication needed to transmit the 
message and half of the destination's processing. 

The model assumes that an entity's processing cost is the same for all transactions, and that communi- 
cations cost is proportional to the size of the message being transmitted. Each reasoner entity is as- 
signed these two cost parameters: 

1. P expresses the entity's processing cost. 

2. C expresses the entity's communications cost for transmitting/receiving a byte of information. 
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The communications overhead for a transaction is assumed to be shared equally by source and destina- 
tion, hence the cost of a transaction for a message containing M bytes is: 

Psource + M * Csource Pdestination -PM* Cdestination 

TransactionCost = 1 

2 2 

The ratio of P:C specifies the processing cost relative to the per byte communications cost. This ratio de- 
pends on a specific vehicle architecture. To gauge the cost over a range of values, we performed the 
cost analysis using ratios of 10, 100, 1000, and 10,000 to one (that is, the processing cost is equivalent to 
the cost of sending 10 bytes, 100 bytes, 1,000 bytes or 10,000 bytes, respectively). 

We assigned all entities at the same architectural level (LRU, Area, Vehicle and Display) the same values 
of P and C. We expected the cost of processing resources to be specific to the architecture of each vehi- 
cle, so we performed the analysis for these three cases: 

1. Cost does not vary by level (LRU = 1, Area = 1, Vehicle = 1, Display = 1) 

2. Cost increases linearly by level (LRU = 1, Area = 2, Vehicle = 3, Display = 4) 

3. Cost increases exponentially by level (LRU = 1, Area = 2, Vehicle = 4, Display = 8) 

For a given entity, the value of its C parameter is the value for the level that contains the entity. The val- 
ue of its P parameter is the P:C ratio selected for the simulation multiplied by the level value. For exam- 
ple, if cost is assumed to be linear across levels and the selected P:C ratio is 1,000, then the cost param- 
eters for a reasoner entity at the area level are: 

1. C = 2 

2. P = 2,000 

The cost for a transaction containing 200 bytes that is sent from an entity in the area level to an entity in 
the Display level is the average of the costs at the two levels. The cost expression above evaluates to 
3,600 for P:C = 1,000 and a linear increase in cost with level: 

2,000 + 200 * 2 4,000 + 200 * 4 

TransactionCost = 1 

2 2 

Reasoner entities sometimes send messages to themselves. In this case, the communications cost is as- 
sumed to be zero, and we computed the processing cost entirely from entity's processing cost metric. 

6.9.2 Cost Analysis Results 

We combined the cost data for the 220 single fault insertions and 231 multiple fault insertions and com- 
pared the relative costs for the hierarchical and flat reference models. The cost advantage for the flat 
model is that it requires less communication than for the hierarchical model. The advantage for the hi- 
erarchical model is that it can have entities on any of the levels while all entities in the flat model exist at 
the vehicle level, which may be a more expensive computing resource. 
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The charts in Figure 21 show the relative 
cost of the hierarchical model to the flat 
model over the four combinations of the 
P:C ratio and the constant, linear and 
exponential cost assumptions. 

Costs are shown relative to the lowest 
cost option, the flat model with P:C ra- 
tion = 10, and constant cost across levels 
of the architecture. Note that this cost 
value is the missing red column in the 
first chart— the value of the column is 
one, but when displayed on a logarith- 
mic axis that starts with value 1, the 
length of the column is zero. 

As expected, the flat model is consist- 
ently cheaper when computing cost is 
constant across the system architecture 
because it requires fewer transactions. 
The hierarchical model is less costly for 
systems where computing costs are 
higher at the higher architectural levels. 
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Figure 21: Relative cost of hierarchical and flat reference models 


For the P:C = 1000 case, the flat model is 
32% less costly than the hierarchical 
model when costs are constant across 
architectural levels, but the hierarchical 
model is 24% less costly for a linear in- 
crease across levels and 51% less costly 
for an exponential increase across lev- 
els. 
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Figure 22: Relative costs for the two models, assuming 
processing cost is equivalent to the communications 
cost for a 1000 byte message 
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6.10 Prognostic: Time to Failure Metrics 

A measure of value of prognostic warning is how far into the future it can predict a fault. We studied the 
effect of these three monitors reported in the airline database for predicting the engine bleed fault: 

1. Rising start 

2. Fast start 

3. Fuel HMA 

The first two monitors are trend monitors; they predict the occurrence of a fault by tracking the changes 
in a parameter value over time. The third is a "super" diagnostic monitor that integrates several param- 
eter values to predict a looming failure. Unlike a typical diagnostic monitor, this monitor fires before the 
actual fault occurs, and hence is a prognosticator. Unlike trend monitors which use the change in a pa- 
rameter's value to predict a fault, a "super" monitor bases its prediction on the current values of several 
parameters. 

Figure 23 below shows the fault prediction based on the rising start trend monitor. 



Figure 23: Fault prediction using the Rising Start trend monitor 

The Y-axis in this figure indicates the relative likelihood of a fault occurrence against the likelihood that 
the fault will not occur. It is plotted on a log scale, so the 0 point indicates that there is an equal proba- 
bility that the fault will occur or not. 


31 


The trend monitor first started predicting a fault about 30 flights before the actual fault occurrence. One 
flight later the fault:no fault ratio increased to 10:1, and 20 flights before the fault occurrence the likeli- 
hood increased to 13:1. 

The Fast Start trend monitor yielded a similarly shaped prediction, as shown in Figure 24. 



Figure 24: Fault prediction using the Fast Start trend monitor 

For this monitor, the prediction of a fault started about 10 flights earlier but yielded a somewhat lower 
likelihood than did the Rising Start trend monitor. 

Fusing the results of these two trend monitors improved the prediction by incorporating the earlier de- 
tection of the Fast Start monitor but also retained the lower likelihood that monitor. The result f fusing 
the two trend monitors is shown in Figure 25. 
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Figure 25: Fault prediction from fusing results from the Rising Start and Fast Start trend monitors 

The shape of the prediction curve for the Fuel HMA diagnostic monitor is different from the shape for 
the trend monitors, as shown in Figure 26. A diagnostic monitor fires when its preconditions are met, and 
in this case, the preconditions were met, and continued to be met for 27 flights before the failure. 
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Figure 26: Fault prediction using the Fuel HMA diagnostic monitor 

The best results are achieved by fusing the results of all three monitors, as shown in Figure 27. The fused 
results begin predicting the fault nearly 40 flights before the occurrence and with a likelihood of greater 
than 2:1. By 25 flights before fault occurrence, the likelihood has increased to 10,000:1 and increases to 
100,000:1 by 20 flights before the fault occurrence. 
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Figure 27: Fault prediction fusing all three monitors 

Another benefit of the fused result is that it yields fewer false alarms. The Fuel HMA monitor on its own 
is prone to generating false alarms. But the influence of the other two monitors, which do not by them- 
selves generate many false alarms, greatly reduces the incidence of false alarms. So the prediction from 
the fused results provides the longest forecast before incident, the highest confidence level, and a low 
false alarm rate. 

6.11 Reasoner Floating point Operations 

The number of floating point operations (FLOP) is calculated as follows: 

1. The calculation in VIPR is event-driven and consists of several steps. 

2. Two events T ±l T 2 drive the calculations: 

a. T ± : A member system provides it a new diagnostic or prognostic monitor. 

b. T 2 \ The active query provides it receives a parametric value from any of the aircraft 
member system. 

Only five steps within VIPR do floating point operations. Other operations are a comparison step, array 
indexing or messaging; these are excluded in the FLOPS calculation. These steps are labeled 
5 1( S 2 , S 3 , S 4 , S 5 as follows: 

1. Prognostic Monitor Generation. This step occurs only when the calculations are driven by T 1 . 
Since VIPR supports four mechanisms, this step is classified as follows: 

a. Condition indicator-based - options include linear and hidden-state trending. 

b. Exceedance-based - options include simple counting and latched-counting. 
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The number of floating point operations is a function of the buffer size (the number of samples 
in the trend history window) and the prediction size (the number of samples in the prediction 
window). Since both these include the calculation of statistical standard deviation, the FLOP 
count is a function of W 2 + B 2 where W is the prediction window size and Bis the buffer size. 
Typically, the number of points in the buffer window and the trend window will be the same or- 
der-of-magnitude. We approximate this as 0(B 2 ). Further, all evidence defined in the reference 
model may have a prognostic monitor. In this case, these operations will be repeated 0(E) 
times, where Bis the number of evidence defined in the VIPR reference model. Hence, the FLOP 
count for this step is 0(EB 2 ). 

2. S 2 : Prognostic vector fusion. The fusion process includes linear interpolation for lining up two 
vectors and obtaining a mean at each piecewise-interpolated point. Since an interpolation func- 
tion consumes 0(B) FLOPs, we use the same notation B to indicate the average number of 
points in the trend window and, hence, the samples in the prognostic window. Further, the fu- 
sion operation is distributive, which makes the fusion independent of the order in which two 
vectors are fused. In the worst case, if the reference model defines ^evidence, we need (E — 1) 
fusion at each step. Hence, the FLOP count for this step is O(EB). 

3. S 3 : Hypothesis likelihood update. This uses a noisy-OR Bayesian model to calculate the posterio- 
ri probabilities for various fault condition hypothesis. Specifically VIPR calculates the log- 
likelihood values for various fault condition hypothesis. If the reference model defines Bevi- 
dence, in the worst case, a given failure mode can be connected to every one of these evidence. 
Further, evidence has a detection probability and a false alarm probability or 3 floating point 
calculations per evidence connected to a failure mode. Hence, the FLOP count for the noisy-OR 
calculation per failure mode hypothesis is 0 iofl (3B); here, the subscript log indicates that the 
number must be multiplied by the FLOP required to perform a natural logarithmic function cal- 
culation. 

VIPR performs the noisy-OR calculation for both single and two-fault hypothesis. In the worst 
case, all F failure modes defined in the reference model may be occurring simultaneously. In 

F(F+ 1) 

this case, we will have — - — hypothesis for which the noisy-OR calculations must happen, 
which implies we will end up with Oi og (F 2 E) floating point operations for this step. 

4. S 4 : Likelihood normalization. Normalizing is done only for single fault hypothesis. It involves sub- 
tracting a minimum value and dividing the result by a maximum value— the three-FLOP-per- 
single-fault hypothesis. Hence, this step will need 0(F) FLOPs. 

5. S 5 : FM distance. New failure modes are assigned to the ambiguity group of a fault condition us- 
ing a pairwise distance calculation. Since the number of F failure modes are pre-defined in the 

F(F+1) 7 

reference model, there can be only — - — pairs; hence, this step will need 0(F ) FLOPs every 
time a change is made to the reference model and not every VPR update cycle. 

Summarizing from the above calculations, we conclude that the number of FLOP per VIPR update step is 
bounded by an upper limit. The order of magnitude of this upper limit is: 

VIPR flop < O log (F 2 E) + 0(EB 2 ) 
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Here: 


1. F is the number of failure modes defined in the reference model, E is the number of evidences 
defined, B is the number of samples used for generating prognostic monitors. Further, we as- 
sume that the prediction window for VIPR will be 0(B). 

2. 0 iO g denotes the order of magnitude for performing a single precision logarithmic calculation. 
The exact number of steps will depend on the processor architecture. 


The run-time computations of VIPR described above are summarized in Table 2. 

Table 2: Floating point operations within VIPR and its upper bound. 


Assumptions 

FLOPS 

Notes 

5]^: Cl-based prognostic 

O (EB 2 ) 

Per VIPR update cycle 

S 2 : Prognostic vector fusion 

0(EW) 

Per VIPR update cycle 

S 3 : Hypothesis likelihood update 

Oi og (F 2 E) 

Per VIPR update cycle 

S 4 : Likelihood normalization 

Olog(F 2 E) 

Per VIPR update cycle 

S 5 : FM distance 

0(F 2 ) 

Per new reference model load 

Total 

< O log (F 2 E)+ 0(EB 2 ) 

Upper bound 


6.12 Other Metrics 

The VIPR software contains about 7000 lines of MATLAB code, half of which implements the reasoner. 
The remaining code is split among messaging, monitors, parsing the reference model (about 1000 lines 
each), and the fault simulator (500 lines). For a reference model that defines N failure modes, the 
reasoner will allocate data structures that consume 0(N 2 ) storage space. 

Other software metrics, such as code complexity, links between software components, etc., have value 
when measured for a production grade implementation and were not computed for the prototype VIPR 
software. 

7 Metrics Derived from the VIPR Hardware-in-the-Loop 
Demonstration 

The VIPR program included a hardware-in-the-loop (HIL) demonstration (Figure 28) that featured Hon- 
eywell's LaserRef VI inertial reference unit (IRU). The demonstration configuration allowed us to expand 
the metrics analysis to include data from commercial avionics equipment. 
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Figure 28: The logical architecture of the VPR Hardware-in-the-Loop demonstration. 


A version of the LaserRef VI IRU, known as the acceptance test vehicle (ATV), was used in the demo. An 
ATV is a LaserRef VI without sensors for measuring linear and angular acceleration. An ATV is used in 
place of a LaserRef VI during product development when providing test sensor data from a file is more 
desirable than using the actual sensors. 

The demo consisted of providing the VIPR reasoner with fault evidence from the fault simulator, the air- 
line database, and the ATV. An evidence stream from the ATV could be received either from the ATV in 
real-time or replayed from a file of recorded ATV output. The demo configuration contained evidence 
streams for two IRUs, one of which had to be the replay of a recorded ATV output file and the other 
could either be the ATV producing the evidence in real-time or a second replay. 

The LaserRef VI output consists of about 30 parameters for expressing the navigation solution and de- 
vice status. The output is formatted as ARINC 429 words and transmitted at a 50 Hz rate. For the demo, 
we replaced the ARINC 429 interface with Ethernet and used it to send a data structure containing about 
half of the LaserRef VI outputs. Because the PC on the receiving end could not keep up with the LaserRef 
VI 50 Hz output rate, we reduced the transmission rate to 25 Hz. 

ATV output processing occurred in an output process that gathered the data, packed the data into a 
structure, and sent the data to a TCP protocol stack for transmission to the PC. Using a development 
tool provided for the ATV, we measured the CPU time of the output process and incremental overhead 
on the TCP network processes for sending our data structure at about 4% of the processor's capacity. 
This value is consistent with the performance of the LaserRef VI, which for production usage is allocated 
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17% of the CPU throughput. This larger budget must support output at 50 Hz instead of 25 Hz, and the 
full complement of LaserRef VI outputs, while in our configuration we transmitted only half. However, 
since the output used in the demo is already being sent for use by other aircraft systems (the FMS in 
particular), VIPR would pose no additional processing on the LaserRef VI. 

The evidence stream generated by the ATV had high fidelity and did not compromise the reasoner's ac- 
curacy. Although the simulation environment used for gathering metrics data did not include the ATV, 
evidence received from the ATV after processing by the diagnostic monitors appears to the reasoner the 
same as monitor input from evidence generated by the fault simulator. Hence, adding the ATV to the 
simulation environment would not have affected results, but would have greatly complicated the pro- 
cess of gathering the metrics because of the difficulty of performing Monte Carlo simulations using the 
ATV. In addition, the ATV's essential need to run in real-time would have significantly slowed the data 
gathering, since simulations that don't include the ATV can run much faster than real-time. 

8 Summary and Conclusions 

We have computed metrics that measure the Reasoner's accuracy, latency, communications bandwidth, 
computational cost and the rate of false alarms. Data was gathered from the insertion of 22 complex 
faults in both single and multiple fault cases, and for both hierarchical and flat aircraft reference models. 

For each single fault condition, we generated 10 evidence streams for the inserted fault that also con- 
tained 0.1% erroneous fault evidence and then ran simulations for each evidence stream. For each of 
the 231 two fault insertion cases, we generated a single evidence stream containing 0.1% erroneous ev- 
idence and ran simulations for each of these evidence streams. A total of 902 simulations containing 
1364 fault conditions were run. 

From the simulation results, we conclude the following: 

• Accuracy The reasoner's ability to correctly isolate a failure from a given evidence stream 

is dependent on the quality of the evidence stream and the correctness of the 
reference model. When simulated with evidence streams that contained 0.1% 
erroneous data, the reasoner correctly and exactly identified the inserted faults 
75% of the single fault cases and 45% of the multiple fault case. In addition, it 
correctly identified the inserted faults, but not uniquely, for an additional 23% 
of the single fault insertions and 31% of the multiple fault cases. Therefore, the 
Reasoner correctly identified 98% of the single fault insertions and 76% of the 
multiple fault insertions. 

Accuracy for the hierarchical and flat reference models were the same. 

The prognostics were accurate on three cases discovered from the airline data- 
base. The case studies on prognostics precursors have shown that the VIPR ap- 
proach detects precursors to safety incidents multiple flights in advance of the 
actual event (in-flight shutdown). 

• Isolation time For the single fault test cases, the typical time to isolate was immediately after 
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fault insertion. The worst case time to isolate was 15 steps and the average time 
was 0.6 steps. 

The time to isolate for the multiple fault scenarios was about 10 times longer 
than for the single fault insertions: the worst case time to isolate was 153 steps 
and the average case was 13.6 steps. 

For sending the ARINC 624 messages generated by the reasoner over a periodic 
safety critical communications system, we computed that 1 KB/second band- 
width would yield, on average, message latency of 1-3 seconds depending on 
reference model and number of faults inserted. Flowever, to reduce the worst 
case latency below the 10 second goal for all simulations required a 10 
KB/second communications bandwidth. 

For the flat model, the isolating reasoner entity is always in the vehicle node. 
Flowever, for a hierarchical model, reasoning may occur at any node. For our 
aircraft hierarchical model, isolation occurred at the LRU level 519 times, once 
at the Area level, and never at the Vehicle level. 

The flat model required 22% fewer messages and 28% fewer bytes in total than 
did the hierarchical model. 

Where computation cost is the same at all aircraft levels (LRU, area, and vehi- 
cle), the computation cost for the flat model was lower than for the hierarchical 
model because the flat model requires fewer transactions to achieve the same 
results as the hierarchical mode. 

Flowever, where computing at higher nodes is more expensive than at the lower 
levels, the distributed hierarchical model is less costly because it can perform 
much of its computation on the lower cost computing resources. 

The cost of computation for the reasoner is proportional to the log of the num- 
ber of faults square and the square of the number of samples used for generat- 
ing prognostic monitors. 

The 0.1% rate of false evidence generated false alarms in only three of the 902 
simulations run. 

The high quality of the LaserRef VI self-diagnostics allowed us to use the de- 
vice's existing output for input to the VIPR software; consequently, integrating 
the LaserRef VI with VIPR added no overhead to the device's operation. 
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